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Abstract 

This  paper  investigates  the  use  of  a  language  inde¬ 
pendent  model  for  named  entity  reeognition  based 
on  iterative  learning  in  a  eo-training  fashion,  using 
word-internal  and  nontextual  information  as  inde¬ 
pendent  evidenee  sourees.  Its  bootstrapping  pro- 
eess  begins  with  only  seed  entities  and  seed  eon- 
texts  extraeted  from  the  provided  annotated  eorpus. 
F-measure  exeeeds  77  in  Spanish  and  72  in  Duteh. 

1.  Introduction 

Our  aim  has  been  to  build  a  maximally  language- 
independent  system  for  named-entity  reeognition 
using  minimal  supervision  or  knowledge  of  the 
souree  language.  The  eore  model  utilized,  ex¬ 
tended  and  evaluated  here  is  based  on  Cueerzan  and 
Yarowsky  (1999).  It  assumes  that  only  an  entity  ex¬ 
emplar  list  is  provided  as  a  bootstrapping  seed  set. 

For  the  partieular  task  of  CoNLL-2002,  the  seed 
entities  are  extraeted  from  the  provided  annotated 
corpus.  As  a  consequence,  the  seed  examples  may 
be  ambiguous  and  the  system  must  therefore  han¬ 
dle  seeds  with  probability  distribution  over  entity 
classes  rather  than  unambiguous  seeds.  Another 
consequence  is  that  this  approach  of  extracting  only 
the  entity  seeds  from  the  annotated  text  does  not  use 
the  full  potential  of  the  training  data,  ignoring  con¬ 
textual  information.  For  example,  Bosnia  appears 
labeled  9  times  as  LOC  and  5  times  as  ORG  and 
the  only  information  that  would  be  used  is  that  the 
word  Bosnia  denotes  a  location  64%  of  the  time, 
and  an  organization  36%  of  the  time,  but  not  in 
which  contexts  is  labeled  one  way  or  the  other.  In 
order  to  correct  this  problem,  an  improved  system 
also  uses  context  seeds  if  available  (for  this  particu¬ 
lar  task,  they  are  extracted  from  the  annotated  cor¬ 
pus).  Because  the  representations  of  entity  candi¬ 
dates  and  contexts  are  identical,  this  modification 
imposes  only  minor  changes  in  algorithm  and  code. 

Because  the  core  model  has  been  presented  in  de¬ 
tail  in  Cucerzan  and  Yarowsky  (1999),  this  paper 


focuses  primarily  on  the  modifications  of  the  algo¬ 
rithm  and  its  adaptation  to  the  current  task.  The  ma¬ 
jor  modifications  besides  the  seed  handling  include 
a  different  method  of  smoothing  the  distributions 
along  the  paths  in  the  tries,  a  new  ’soft’  discourse 
segmentation  method,  and  use  of  a  different  label¬ 
ing  methodology,  as  required  by  the  current  task  i.e. 
no  overlapping  entities  are  allowed  (for  example, 
the  correct  labeling  of  colegio  San  Juan  Bosco  de 
Merida  is  considered  to  be  OBXj{colegio  San  Juan 
Bosco)  de  hOCiMerida)  rather  than  ORG{colegio 
PERlS'an  Juan  Bosco)  de  GOC{Merida))). 

2.  Entity-Internal  Information 

Two  types  of  entity-internal  evidence  are  used  in  a 
unified  framework.  The  first  consists  of  the  pre¬ 
fixes  and  suffixes  of  candidate  entities.  For  exam¬ 
ple,  in  Spanish,  names  ending  in  -ez  (e.g.  Alvarez 
and  Gutierrez)  are  often  surnames;  names  ending  in 
-ia  are  often  locations  (e.g.  Austria,  Australia,  and 
Italia).  Likewise,  common  beginnings  and  endings 
of  multiword  entities  (e.g.  Asociacion  de  la  Prensa 
de  Madrid  and  Asociacion  para  el  Desarrollo  Rural 
Jerez-Sierra  Suroeste,  which  are  both  organizations) 
are  good  indicators  for  entity  type. 

3.  Contextual  Information 

An  entity’s  left  and  right  context  provides  an  essen¬ 
tially  independent  evidence  source  for  model  boot¬ 
strapping.  This  information  is  also  important  for  en¬ 
tities  that  do  not  have  a  previously  seen  word  struc¬ 
ture,  are  of  foreign  origin,  or  polysemous.  Rather 
than  using  word  bigrams  or  trigrams,  the  system 
handles  the  context  in  the  same  way  it  handles  the 
entities,  allowing  for  variable-length  contexts.  The 
advantages  of  this  unified  approach  are  presented  in 
the  next  paragraph. 

4.  A  Unified  Structure  for  both  Internal  and 
Contextual  Information 

Character-based  tries  provide  an  effective,  efficient 
and  flexible  data  structure  for  storing  both  con¬ 
textual  and  morphological  patterns  and  statistics. 
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...  organizada  por  la  Concejalia  de  Cultura  ,  tienen  un  ... 

LEFT  CONTEXT  A  PREFIX  A  RIGHT  CONTEXT 

^  SUFFIX  ^ 

Figure  1:  An  example  of  entity  candidate  and  context  and 
the  way  the  information  is  introduced  in  the  four  tries  (arrows 
indicate  the  direction  letters  are  considered) 

They  are  very  eompaet  representations  and  support 
a  natural  hierarehieal  smoothing  proeedure  for  dis¬ 
tributional  elass  statisties.  In  our  implementation, 
eaeh  terminal  or  branehing  node  eontains  a  prob¬ 
ability  distribution  whieh  eneodes  the  eonditional 
probability  of  entity  elasses  given  the  sistring  eor- 
responding  to  the  path  from  the  root  to  that  node. 
Eaeh  sueh  distribution  also  has  two  standard  elasses, 
named  “questionable”  (unassigned  probability  mass 
in  terms  of  entity  elasses,  to  be  motivated  below) 
and  “non-entity”  (eommon  words). 

Two  tries  (denoted  PT  and  ST)  are  used  for  in¬ 
ternal  representation  of  the  entity  eandidates  in  pre¬ 
fix,  respeetively  suffix  form,  respeetively.  Other  two 
tries  are  used  for  left  (LCT)  and  right  (RCT)  eon- 
text.  Right  eontexts  are  introdueed  in  RCT  by  eon- 
sidering  their  eomponent  letters  from  left  to  right, 
left  eontexts  are  introdueed  in  LCT  using  the  re¬ 
versed  order  of  letters,  from  right  to  left  (Figure  1). 
In  this  way,  the  system  handles  variable  length  eon¬ 
texts  and  it  attempts  to  mateh  in  eaeh  instanee  the 
longest  known  eontext  (as  longer  eontexts  are  more 
reliable  than  short  eontexts,  and  also  the  longer  eon- 
text  statisties  ineorporate  the  shorter  eontext  statis¬ 
ties  through  smoothing  along  the  paths  in  the  tries). 

The  tries  are  linked  together  into  two  bipartite 
struetures,  PT  with  LCT,  and  ST  with  RCT,  by  at- 
taehing  to  eaeh  node  a  list  of  links  to  the  entity  ean¬ 
didates  or  eontexts  with,  respeetively  in  whieh  the 
sistring  eorresponding  to  that  node  has  been  seen  in 
the  text  (Figure  2). 

5.  Unassigned  Probability  Mass 

When  faeed  with  a  highly  skewed  observed  elass 
distribution  for  whieh  there  is  little  eonfidenee  due 
to  small  sample  size,  a  typieal  response  is  to  baek- 
off  or  smooth  to  the  more  general  elass  distribution. 
Unfortunately,  this  representation  makes  problem- 
atie  the  distinetion  between  a  baek-off  eonditional 
distribution  and  one  based  on  a  large  sample  (and 
henee  estimated  with  eonfidenee).  We  address  this 
problem  by  explieitly  representing  the  uneertainty 
as  a  elass,  ealled  "questionable".  Probability  mass 
eontinues  to  be  distributed  among  the  primary  en¬ 
tity  elasses  proportional  to  the  observed  distribu¬ 
tion  in  the  data,  but  with  a  total  sum  that  refleets 
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Figure  2:  An  example  of  links  between  the  Suffix  Trie  and  the 
Right  Context  Trie  for  the  entity  candidate  Austria  and  some  of 
its  right  contexts  as  observed  in  the  corpus  (<  ,  Holanda  >, 
<  ,  hizo  >,  <  a  Chirac  >) 

the  eonfidenee  in  the  distribution  and  is  equal  to 

1  PfiQfjgi^quest) . 

Ineremental  learning  essentially  beeomes  the  pro- 
eess  of  gradually  shifting  probability  mass  from 
questionable  to  one  of  the  primary  elasses. 

6.  Smoothing 

The  probability  of  an  entity  eandidate  or  eontext  as 
being  or  indieating  a  eertain  type  of  entity  is  eom- 
puted  along  the  path  from  the  root  to  the  node  in 
the  trie  strueture  deseribed  above.  In  this  way,  ef- 
feetive  smoothing  ean  be  realized  for  rare  entities 
or  eontexts.  A  smoothing  formula  taking  advantage 
of  the  distributional  representation  of  uneertainty  is 
presented  below. 

For  a  sistring  hh-'-ln  (i.e.  the  path  in  the  trie  is 
root  —  li  —  h  —  ■■■  —  In)  the  general  smoothing  model 
for  the  eonditional  elass  probabilities  is  given  by  the 
reeursive  formula: 

where  Z  is  a  normalization  faetor  and  /3  € 

[0, 1] ,  a  >  1  are  model  parameters. 

7.  One  Sense  per  Discourse 

Clearly,  in  many  eases,  the  eontext  for  only  one 
instanee  of  an  entity  and  the  word-internal  infor¬ 
mation  is  not  enough  to  make  a  elassifieation  de- 
eision.  But,  as  noted  by  Katz  (1996),  a  newly  in¬ 
trodueed  entity  will  be  repeated,  “if  not  for  break¬ 
ing  the  monotonous  effeet  of  pronoun  use,  then  for 
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Figure  3l  Using  contextual  clues  from  all  instances  of  an  en¬ 
tity  candidate  in  the  corpus.  Each  instance  is  depicted  as  a  disc 
with  the  diameter  representing  the  confidence  of  the  classifica¬ 
tion  of  that  instance  using  word-internal  and  local  contextual 
information. 

emphasis  and  elarity”.  We  use  this  property  in  eon- 
junetion  with  the  one  sense  per  discourse  tendeney 
noted  by  Gale  et  al.  (1992).  The  later  paradigm  is 
not  direetly  usable  when  analyzing  a  large  eorpus 
in  whieh  there  are  no  doeument  boundaries,  like  the 
one  provided  for  Spanish.  Therefore,  a  segmenta¬ 
tion  proeess  needs  to  be  employed,  so  that  all  the 
instanees  of  a  name  in  a  segment  have  a  high  proba¬ 
bility  of  belonging  to  the  same  elass.  Our  approaeh 
is  to  eonsider  a  ’soft’  segmentation,  whieh  is  word- 
dependent  and  does  not  eompute  topie/doeument 
boundaries  but  regions  for  whieh  the  eontextual  in¬ 
formation  for  all  instanees  of  a  word  ean  be  used 
jointly  when  making  a  deeision.  This  is  viewed 
as  an  alternative  to  the  elassieal  topie  segmenta¬ 
tion  approaeh  and  ean  be  used  in  eonjunetion  with  a 
language-independent  segmentation  system  (Figure 
3)  like  the  one  presented  by  Riehmond  et  al.  (1997). 

After  estimating  the  elass  probability  distribu¬ 
tions  for  all  instanees  of  entity  eandidates  in  the  eor¬ 
pus,  a  re-estimation  step  is  employed.  The  probabil¬ 
ity  of  an  entity  elass  Cj  given  an  entity  eandidate  w 
at  position  posi  is  re-eomputed  using  the  formula: 

P'{cj\w,posi)  =  I  P{cj\w,posk)-  ^2) 
sim{posi,posk)  ■  conf{w,posk) 

where  posi,  ...,poSn  are  the  positions  of  all  in¬ 
stanees  of  w  in  the  eorpus,  sim  is  the  positional 
similarity,  eneoding  the  physieal  distanee  and  topie 
(if  topie  or  doeument  boundary  information  exists), 
conf  is  the  elassifieation  eonfidenee  of  eaeh  instanee 
(inverse  proportional  to  the  the  P{quest\w,posj^), 
Z  is  a  normalization  faetor. 

8.  Entity  Identification  /  Multiple-Word  Entities 

There  are  two  major  alternatives  for  handling 
multiple-word  entities.  A  first  approaeh  is  to  tok- 
enize  the  text  and  elassify  eaeh  individual  word  as 
being  or  not  part  of  an  entity,  proeess  followed  by  an 
entity  assemblanee  algorithm.  A  seeond  alternative 


Figure  4:  The  structure  of  an  entity  candidate  represented  as 
an  automaton  with  two  final  states 

is  to  eonsider  a  ehunking  algorithm  that  identifies 
enfify  eandidafes  and  elassify  eaeh  of  fhe  ehunks  as 
Person,  Loeafion,  Organization,  Miseellaneous,  or 
Non-enfify.  We  use  fhis  seeond  alfernafive,  buf  in  a 
’sofl’  form;  i.e.  eaeh  word  ean  be  ineluded  in  multi¬ 
ple  eompefing  ehunks  (entity  eandidafes).  This  ap¬ 
proaeh  is  suifable  for  all  languages  ineluding  Chi¬ 
nese,  where  no  word  separators  are  used  (fhe  en- 
fify  eandidafes  are  defermined  by  speeifying  sfarfing 
and  ending  eharaefer  posifions).  Anofher  advanfage 
of  fhis  mefhod  is  fhaf  single  and  mulfiple-word  en- 
fifies  ean  be  handled  in  fhe  same  way. 

The  boundaries  of  enfify  eandidafes  are  defer¬ 
mined  by  a  few  simple  rules  ineorporafed  info  fhree 
diseriminafors:  is_B_candidate  fesfs  if  a  word  ean 
represenf  fhe  beginning  of  an  enfify,  is_I_candidate 
fesfs  if  a  word  ean  be  fhe  end  of  an  enfify,  and 
is_E_candidate  fesfs  if  a  word  ean  be  an  infernal 
pari  of  an  enfify.  These  diseriminafors  use  simple 
heurisfies  based  on  eapifalizafion,  position  in  sen- 
fenee,  lengfh  of  fhe  word,  usage  of  fhe  word  in 
fhe  sef  of  seed  enfifies,  and  eo-oeeurrenee  wifh  un- 
eapifalized  insfanees  of  fhe  same  word.  A  siring  is 
eonsidered  an  enfify  eandidale  if  if  has  fhe  slruelure 
shown  in  Figure  4. 

An  extension  of  fhe  syslem  also  makes  use  of 
Parl-of-Speeeh  (POS)  lags.  We  used  fhe  pro¬ 
vided  POS  annolalion  in  Dufeh  (Daelemans  el  ah, 
1996)  and  a  minimally  supervised  lagger  (Yarowsky 
and  Cueerzan,  2002)  for  Spanish  lo  reslriel  fhe 
spaee  of  words  aeeepled  by  fhe  diseriminafors  (e.g. 
is_B_candidate  rejeels  preposilions,  eonjunelions, 
pronouns,  adverbs,  and  Ihose  delerminers  fhaf  are 
fhe  firsl  word  in  fhe  senlenee). 

9.  Algorithm  Structure 

The  eore  algorithm  ean  be  divided  into  eight  stages, 
whieh  are  summarized  in  Figure  5.  The  bootstrap¬ 
ping  stage  (5)  uses  the  initial  or  eurrent  entity  as¬ 
signments  to  estimate  the  elass  eonditional  distribu¬ 
tions  for  both  entities  and  eontexts  along  their  trie 
paths,  and  then  re-estimates  the  distributions  of  the 
eontexts/entity-eandidates  to  whieh  they  are  linked, 
reeursively,  until  all  aeeessible  nodes  are  reaehed, 
as  presented  in  Cueerzan  and  Yarowsky  (1999). 


1 

Extract  the  entity  (and  context)  seed  sets  from  the  annotated  data 

Read  the  text  to  be  annotated  and  extract  all  entity-candidates 

Extract  the  sets  LC  and  RC  of  all  contexts  of  entity  candidates 

T 

Build  the  tries  using  all  individual  words  and  entity  candidates, 
and  all  instances  of  the  elements  LC  and  RC  from  the  text 

T 

Apply  the  bootstrapping  procedure  using  the  seed  data 

~6 

Classify  each  entity-candidate  in  isolation 

T 

Re-classify  each  entity-candidate  by  using  formula  (2) 

T 

Resolve  conflicts  between  competing  entity  candidates 

Figure  5 :  Algorithm  structure 


10.  Results 

We  eompare  the  results  of  two  variants  of  the  de- 
seribed  model  on  the  development  and  test  sets  pro¬ 
vided  (Table  1).  The  first  one  uses  only  exemplar 
entity  and  eontext  seeds  extraeted  from  the  training 
eorpus.  The  seeond  also  employs  POS  information 
to  rule  out  unlikely  entity  eandidates. 

The  system  was  built  and  tested  initially  utiliz¬ 
ing  only  the  provided  Spanish  data.  The  parame¬ 
ters  were  estimated  using  an  80/20  split  of  the  train¬ 
ing  data  {esp.train  and  ned. train).  The  dev-test  data 
{testa)  were  not  used  during  the  parameter  estima¬ 
tion  phase.  The  programs  were  run  onee  on  the  final 
fesf  dafa  (files  testb).  We  alloeafed  only  one  person- 
day  fo  adapf  fhe  sysfem  for  Dufeh  and  fune  fhe  pa- 
ramefers  fo  fhis  language  in  order  fo  show  funefional 
language  independenee.  We  opfed  nol  fo  make  a 
defailed  sfudy  of  paramefer  variation  on  fesf  dafa  fo 
avoid  any  pofenfial  for  funing  fo  fhis  resouree  and 
preserve  ifs  value  for  fufure  system  developmenf. 

The  following  fable  furfher  defails  fhe  fypes  of 
errors  made  by  fhe  algorifhm  (full  system  on  Span¬ 
ish  dev-sef).  Ei  represenfs  fhe  number  of  over¬ 
generated  and  under-generated  enfifies  in  fhe  pre- 
eision  and  reeall  rows  (respeefively).  E2  repre- 
senfs  fhe  number  of  enfifies  wifh  eorreefly  identified 
boundaries,  buf  wrong  elassifieafions. 
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Precision 

43  +  153 
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87+341 

76+170 

Recall 

43+123 

34+310 

112+279 

22+70 

Beeause  our  sysfem  fakes  seed  lisfs  rafher  fhan 
annofafed  fexf  as  inpuf,  addifional  enfify  lisfs  ean 
be  used  by  fhe  sysfem.  By  employing  sueh  lisfs 
of  eounfries,  major  eifies,  frequenf  person  names 
and  major  eompanies  (exfraefed  from  fhe  web),  sig- 
nifieanf  improvemenfs  ean  be  obfained  (preliminary 
fesfs  show  as  mueh  as  2.5  F-measure  improvemenf 
on  a  80/20  splif  of  fhe  fraining  dafa  in  Dufeh). 

11.  Conclusion 

This  paper  has  presenfed  and  evaluated  an  ex¬ 
tended  boofslrapping  model  based  on  Cueerzan  and 
Yarowsky  (1999)  fhaf  uses  a  unified  framework  of 
bofh  enfify  infernal  and  eonfexfual  evidenee.  Sfarf- 
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LOG 
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PER 
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84.29 

86.07 

83.96 
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Overall 

75.08 
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Spanish 

Test 

without  POS  information 

1  with  POS  information 

Precision 

Recall 

F-meas. 

Precision 

Recall 

F-meas. 

LOG 

78.62 

73.62 

76.04 

79.66 

73.34 

76.37 

MISG 

63.73 

38.24 

47.79 

64.22 

38.53 

48.16 

ORG 

74.86 

78.50 

76.64 

76.79 

81.07 

78.87 

PER 

80.63 

87.21 

83.79 

82.57 

88.30 

85.34 

Overall 

76.62 

74.96 

75.78 

78.19 

76.14 

77.15 

Dutch 

Dev-set 

without  POS  information 

with  POS  information  | 

Precision 

Recall 

F-meas. 

Precision 

Recall 

F-meas. 

LOG 

73.30 

70.38 

71.81 

76.87 

73.32 

75.05 

MISG 

64.08 

57.64 

60.69 

68.16 

63.14 

65.55 

ORG 

67.34 

53.76 

59.79 

70.63 

55.96 

62.45 

PER 

63.17 

79.94 

70.57 

64.99 

80.51 

71.92 

Overall 

66.10 

65.01 

65.55 

69.14 

67.84 

68.49 

Dutch 

Test 

without  POS  information 

with  POS  information  | 

Precision 

Recall 

F-meas. 

Precision 

Recall 

F-meas. 

LOG 

73.65 

77.56 

75.55 

77.72 

80.54 

79.11 

MISG 

70.10 

57.29 

63.05 

74.67 

62.34 

67.95 

ORG 

69.78 

62.14 

65.74 

72.12 

64.88 

68.31 

PER 

67.62 

79.26 

72.98 

69.39 

80.71 

74.62 

Overall 

69.95 

68.49 

69.21 

73.03 

71.62 

72.31 

Table  1:  Results  on  the  development  sets  (files  esp. testa  and 
ned.  testa)  and  on  the  test  sets  (files  esp. testb  and  ned.  testb) 


ing  only  wifh  enfify  and  eonfexf  seeds  exfraefed 
from  fraining  dafa  and  fhe  addition  of  parf-of-speeeh 
informafion,  sysfem  performanee  exeeeds  77  and  72 
F-measure  for  Spanish  and  Dufeh  respeefively. 
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